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TO WHOM IT MAY CONCERN, THE FOLLOWING IS 
A SPECIFICATION OF THE AFORESAID INVENTION 



METHOD FOR ELIMINATING AN UNWANTED SIGNAL FROM A MIXTURE 

VIA TIME-FREQUENCY MASKING 



BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention relates to the field of audio and signal processing, and, 
more particularly, to eliminating an unwanted signal from a mixture of a desired signal 
and an unwanted signal. 

2. Description of the Related Art 

A voice sample can be a mixture of a desired signal and an unwanted signal. For 
example, the desired signal may be a voice, and the unwanted signal may be background 
music. If the background music is of a sufficient auditory level in relation to the auditory 
level of the voice, the desired signal may be masked by the background music such that 
the desired signal cannot be clearly understood. Therefore, it would be advantageous to 
eliminate or reduce the unwanted signal firom the recording such that the desired signal 
can be more clearly xmderstood. 

Classical techniques for eliminating an unwanted signal are the Widrow-Hoflf 
techniques. The Widrow-Hoff techniques are prone to certain errors. It is sensitive to 
errors in phase estimates of a filter and an unwanted signal. It is also unreliable if a side 
signal and a mixture are not aligned properly. 
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SUMMARY OF THE INVENTION 

In one aspect of the present invention, a method for eliminating or reducing an 
unwanted signal from a recorded mixture of a desired signal and an unwanted signal 
given an original recording of the unwanted signal is provided. The method includes 
aligning the recorded mixture and the original recording; computing a time-frequency 
representation of the recorded mixture to create a time-frequency recorded mixture; 
computing a time-frequency representation of the redefined original recording to create a 
time-frequency redefined original recording; determining a segment of time when only 
the redefined original recording is present in the recorded mixture; computing a value 
a(cy); generating a time-frequency mask using the value the time-frequency 
recorded mixture and the time-frequency redefined original recording; applying the time- 
frequency mask on the recorded mixture to compute a time-frequency desired signal; and 
inverting the time^frequency desired signal to create a desired signal. 

In another aspect of the present invention, a machine-readable medivun having 
instructions stored thereon for execution by a processor to perform a method for 
eliminating or reducing an unwanted signal from a recorded mixture of a desired signal 
and an unwanted signal given an original recording of the unwanted signal is provided. 
The medium contains instructions for aligning the recorded mixture and the original 
recording; computing a time-frequency representation of the recorded mixture to create a 
time-frequency recorded mixture; computing a time-frequency representation of the 
redefined original recording to create a time-frequency redefined original recording; 
determining a segment of time when only the redefined original recording is present in 
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the recorded mixture; computing a value a(co); generating a time-frequency mask using 
the value a(o>), the time-freqUency recorded mixture and the time-frequency redefined 
original recording; applying the time-frequency mask on the recorded mixture to compute 
a time-frequency desired signal; and inverting the time-frequency desired signal to create 
5 a desired signal. 

In yet another embodiment of the present invention, a method for eliminating or 
reducing an unwanted signal from a recorded mixture of a desired signal and an 
unwanted signal given an original recording of the unwanted signal is provided. The 
method includes aligning the recorded mixture and the original recording; computing a 

10 time-scale representation of the recorded mixture to create a time-scale recorded mixture; 
computing a time-scale representation of the redefined original recording to create a time- 
scale redefined original recording; determining a segment of time when only the 
redefined original recording is present in the recorded mixture; computing a value a(co); 
generating a time-scale mask using the value^(ft>), the time-scale recorded mixture and 

15 the time-scale redefined original recording; applying the time-scale mask on the recorded 
mixture to compute a time-scale desired signal; and inverting the time-scale desired 
signal to create a desired signal. 

BRIEF DESCRIPTION OF THE DRAWINGS 
20 The invention may be understood by reference to the following description taken 

in conjunction with the accompanying drawings, in which like reference ntmierals 
identify like elements, and in which: 
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FIG. 1 depicts a flow diagram of a method for eliminating or reducing an 
unwanted signal, in accordance with one illustrative embodiment of the present ' 
invention; 

FIG. 2 depicts a pictorial time domain representation of a mixture x and an 
unwanted signal ro, in accordance with one illustrative embodiment of the present 
invention; 

FIG. 3 depicts a pictorial time domain representation of the mixture x and the 
unwanted signal ro of FIG. 2, further illustrating a delay between the mixture x and the 
unwanted signal ro, in accordance with one illustrative embodiment of the present 
invention; 

FIG. 4 depicts a pictorial time domain representation of the unwanted signal ro of 
FIG. 2 and FIG. 3 and a redefined unwanted signal r/,.m accordance with one illustrative 
embodiment of the present invention; 

FIG. 5 depicts a pictorial time-frequency representation of the mixture jc and the 
redefined unwanted signal r^ , in accordance with one illustrative embodiment of the * 
present invention; 

FIG. 6 depicts a pictorial time domain representation of the mixture jc of FIG. 2 
and FIG. 3 and the redefined unwanted signal rj of FIG. 4, further illustrating a time 
segment when only the redefined unwanted signal rj is present, in accordance with one 
illustrative embodiment of the present invention; 

FIG. 7 depicts a pictorial time-frequency representation of the mixture jc and the 
redefined unwanted signal r, of FIG. 5, further illustrating a(Q)), in accordance with one 
illustrative embodiment of the present invention; 
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FIG. 8 depicts a pictorial representation of a time-frequency mask, in accordance 
with one illustrative embodiment of the present invention; 

FIG. 9 depicts a pictorial time-frequency representation of the mixtures of FIG. 5 
and FIG. 7 after the time-frequency mask of FIG. 8 is applied, in accordance with one 
illustrative embodiment of the present invention; and 

FIG. 10 depicts a time domain representation of a desired signal of the mixture x 
of FIG. 2, FIG. 3, and FIG. 6, in accordance with one illustrative embodiment of the 
present invention. 

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS 
Illustrative embodiments of the invention are described below. In the interest of 
clarity, not all features of an actual implementation are described in this specification. It 
will of course be appreciated that in the development of any such actual embodiment, 
numerous implementation-specific decisions must be made to achieve the developers* 
specific goals, such as compliance with system-related and business-related constraints, 
which will vary from one implementation to another. Moreover, it will be appreciated 
that such a development effort might be complex and time-consuming, but would 
nevertheless be a routine undertaking for those of ordinary skill in the art having the 
benefit of this disclosure. 

While the invention is susceptible to various modifications and altemative forms, 
specific embodiments thereof have been shown by way of example in the drawings and 
are herein described in detail. It should be understood, however, that the description 
herein of specific embodiments is not intended to limit the invention to the particular 
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forms disclosed, but on the contrary, the intention is to cover all modifications, 
equivalents, and alternatives falling within the spirit and scope of the invention as defined 
by the appended claims. • 

It is to be understood that the systems and methods described herein may be 
implemented in various forms of hardware, software, firmware, special purpose 
processors, or a combination thereof In particular, at least a portion of the present 
invention is preferably implemented as an application comprising program instructions 
that are tangibly einbodied on one or more program storage devices (eg., hard disk, 
magnetic floppy disk, RAM, ROM, CD ROM, etc.) and executable by any device or 
machine comprising suitable architecture, such as a general purpose digital computer 
having a processor, memory, and input/output interfaces. It is to be further understood 
that, because some of the constituent system components and process steps depicted in 
the accompanying Figures are preferiably implemented in software, the coimections 
between system modules (or the logic flow of method steps) may differ depending upon 
the manner in which the present invention is programmed. Given the teachers herein, one 
of ordinary skill in the related art will be able to cdntemplate these and similar 
implementations of the present invention. 

A method is presented for eliminating an unwanted signal (e.g., background 
music, interference, etc.) fi-om a mixtxu"e of a desired signal and the unwanted signal via 
time-frequency masking. Given a mixture of the desired signal and the unwanted signal, 
the goal of the present invention is to eliminate or at least reduce the effects of the 
unwanted signal to obtain an estimate of the desired signal. For example, although not 
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so limited, the desired signal can be voice and the unwanted signal could be music. The 
goal, therefore, would be to eliminate or at least reduce the music from the mixture. 

The method requires a side information signal, which is a signal with related 
instantaneous spectral powers to the unwanted signal. Such a signal is often available. 
For example, in the scenario where the unwanted signal is music from a digital recording 
(e.g., a compact disc) or an analog recording (e.g., a cassette tape), the original digital or 
analog recording can serve as the side information signal. 

The method comprises three general steps, which are fiirther elaborated through 
the present disclosure. First, the mixture and the side information signal are roughly 
aligned so that sovmds in each occur approximately at the same time. Second, an estimate 
of the relationship (i.e., spectral weights) between the instantaneous spectral powers of 
the side information signal and its presence in the mixture is computed using a section of 
the mixture which contains little to no contribution from the desired signal but a 
relatively large contribution from the unwanted signal. Third, a time-frequency mask is 
created comparing the weighted instantaneous spectral powers of the side information 
signal to the mixture instantaneous spectral powers. Time-frequency points which are 
likely dominated by the unwanted signal are suppressed to remove the unwanted signal 
from the mixture. The result is a clearer desired signal. 

Consider a recording of a mixture of a desired signal, s(t), and an unwanted 
signal, r(0, 

x(t)=s(t)+r(t). 

Although the present invention is not so limited, it is assumed solely for discussion 
purposes that the desired signal is voice and the unwanted signal is music. It is fiirther 
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assumed that the music signal in the recording was played on a stereo or the like, and that 
the original recording (ue., the side information signal) is available, for example in the 
form of a cassette tape or compact disc. The original recording can be referred to as ro(/). 
The imwanted signal r(t) and original recording version ro(0 are clearly related, although 
in general r(t) ?^o(0 because r(0 has been altered by the recording process, as is known 
to those skilled in tiie art. That is, r(t) is a filtered version of ro(t) and this transforming 
filter is unknown. The goal of the present invention is to estimate s(t) given and ro{t). 

The mixing in the time-fi^equency domain can be expressed using the windowed 
Fourier transform. The windowed Foxuier transform of jc is defined, 

V2;r 

which.is referred to as xit^O)) . The mixture in the time-frequency domain is expressed, 

x{t,(D) = s{t,a)) + r{t,(0). \ 
It is assumed that a filter process can be modeled as r(^fi>) ^ h{o))rQ{t^a>) , such that 
mixing is, 

x(t,CD) = s(t,(o) + h(co)rQ{tyO)), 
A time-frequency niaskj m(t, CO) , is created such that the mask preserves most of the 
desired source of power, 

and results in a high output signal to interference ratio, 

\\m(ty (D)s{t, a))f » \\m{t, (0)r{t, Q))f . 
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For such a mask, converting m{t,(0)x{t,co) back into the time domain will create the 
desired signal, s{i). Thus, the goal of the estimated s{t) can be achieved by determining 
an appropriate time-frequency mask m{t, eo). 

In one embodiment, the method described herein can be performed with the 
following steps: 

1. Obtaining a mixture x(t) and a related side information signal r^Q) . 

2. Aligning x(t) and ^^(0 using a suitable alignment technique known to those 
skilled in the art, such as manual or correlation-based alignment. 

3. Computinga time-frequency representation jc(/, 69) and r(/,ty) . 

4. Locating a portion of x(t) which is dominated by r(t) . That is, finding a 
range of / e (t^.t^) such that x(t) « r(t) for t in this range. 

5. Estimating \h(a))\ (/.e, a filter) via, 



a(6^) = 



f \xit,a))r,(tMdt 



6. Generating a time-frequency mask, 



m(t,a)) = 



1 -y M'V'-j "^7 



6;) 

a'(ty)|r(^a^) i 
0 // otherwise 



where a is set to maximize intelligibility. Although not so limited, a default 
value can be a = 2 . 

7. Applying the mask to the mixture and converting the result, m{t,o))x{t,0)) , 
back into the time domain. 
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An alternate embodiment of the method described herein will now be presented. 
Referring now to Figure 1 , a recorded mixture signal jc and a played unwanted signal ro 
are acquired (at 105). The goal of the method described herein, as previously stated, is to 
produce a desired signal s from the recorded mixture x. Referring now to Figure 2, a 
sample reading 200 is shown. The sample reading 200 comprises time domain 
representations 205 of the mixture signal x 2 1 0 and the unwanted signal 2 1 5 . It is 
understood that the pictorial time domain representations 205 of various signals described 
herein are only used for illustrative purposes. The method described herein may be 
implemented with or without creating the pictorial time domain representations 205. As 
illustrated in the present disclosure, the horizontal axis of the time domain representations 
205 represents a number of samples, and the vertical axis represents an amplitude of the 
signal. The number of samples depends on any of a variety factors, including sampling 
frequency, hardware/software constraints, and user-defined constraints, as known to those 
skilled in the art. Siihilarly, the representation of amplitude may depend on any of a 
variety of factors, including hardware/software constraints and user-defined constraints. 

Referring again to Figure 1 , the mixture signal and the unwanted signal are 
aligned (at 1 10). As shown by a pair of guide lines 305 in Figure 3, the mixture signal x 
210 and the unwanted signal rol\5 of the sample reading 200 are misaligned by an 
estimated delay 310. The delay 310 can be estimated manually (e.g., through human 
optical inspection) or through cross-correlation. The unwanted signal ro is redefined, 
taking into account the delay 310 of Figure 3. As shown in Figure 4, r; represents a 
redefined unwanted signal 405 that is now at least substantially aligned {i.e., there may be 
error in estiniating the delay 310) with the mixture signal x 210 of Figure 2 and Figure 3. 
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The pictorial representation of the unwanted signal ro 215 is shown in Figure 4 for 
comparative purposes. 

Referring again to Figure 1, time-frequency representations are computed (at 
120). Referring now to Figure 5, pictorial time-frequency representations 500 are shown 
for the mixture signal x 505 and the redefined unwanted signal r, 510. As with the time 
domain representations 205, the pictorial time-frequency representations 500 presented 
herein are shown solely for illustrative purposes. The method described herein may be 
implemented with or without the pictorial time-frequency representations 500. As 
illustrated in the present disclosure, the horizontal axis of the time-frequency 
representations 500 represents a number of samples, and the vertical axis represents a 
frequency (in Hz) of the signal. 

Referring again to Figure 1, a segment of time is determined (at 125) when only 
the redefined imwanted signal r, 405 of Figure 4 is present in the mixture signal jc 210 of 
Figure 2 and Figure 3. As shown in Figure 6, the segment 605 represented by the time 
interval (tj, f^) illustrates a segment of time when only the redefined wanted signal rj 405 
is present in the mixture signal jc 210. In other words, this is the segment of time when 
the desired signal is not of a sufficient auditory level to be heard by a human or does not 
exist. 

Referring again to Figure 1, the value a((o) (/,e., modulus of the filter h{0)) ) is 
computed (at 130) from the time-frequency representations 500 of the mixture signal jc 
505 and the redefined unwanted signal 5 10 of Figure 5. The value a{a)) can be 
computed with the following equation, as described in greater detail above: 
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a{Q)) = 



ie('o.'i) 


x{t,(jo)r^{t,co) 


dt 




f{t,(D)f dt 



As shown herein, a{Q)) = . Referring now to Figure 7, the value aim) 705 is 

illustrated with respect to the time-frequency representations 500 of the mixture signal x 
505 and the redefined unwanted signal 510 of Figure 5. 
5 Referring again to Figure 1 , a time-frequency mask is generated (at 1 35). The 

time-frequency mask can be generated using the following equation, as described in 
greater detail above: 



m{t,co) = < 



a\6))\r{Uco) 
0 if otherwise 



1 ^l*, CC/I 



Referring now to Figure 8, . a pictorial representation of a time-frequency mask 800 
10 . consistent with the present embodiment is shown. The resulting time-frequency mask 

800 can have a value of 0 or 1 , depending on the time-frequency point. The lighter time- 
frequency points of the time-frequency mask 800 represent a 1 value. The darker time- 
frequency points of the time-frequency mask 800 represent a 0 value. 

Referring again to Figure 1 , the time-frequency mask 800 of Figure 8 is appUed 
15 (at 140) on the mixture signal jc of 505 of Figure 5 aiid the value 5 = Jc • mask is 

computed (at 140). Referring now to Figure 9, a pictorial representation 900 of the 
mixture signal x of 505 of Figure 5 after the time-frequency mask 800 of Figure 8 is 
applied is shown. As illustrated, the lighter time-frequency points represent a 1 \x\ value 

(/.e., Ixl =1), and the darker time-frequency points represent a 0 value {i.e., IjcI =0). 
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Referring again to Figure 1, the value s is inverted (at 145) into a time domain to 
obtain an estimate of a desired signal. Inversion is well known to those skilled in the art. 
In one embodiment, the following equation, 



may be inverted. The result of computing the inverted equation is inverting s into the 
time domain. Referring now to Figure 10, a pictorial time domain representation of the 
desired signal s 1000 is illustrated. 

Although the embodiments illustrated herein show continuous time signals, it is 
understood that the present invention can be applied to sample signals. In discrete time, 
the windowed Fourier transform would be a windowed DFT (discrete time Fourier 
transform) and the estimates of the filter \h(a))\ would be finite sums over discrete time 

points for each frequency center. In another embodiment, the windowed Fourier 
transform can be replaced by a wavelet transform, which is a time-scale representation 
defined by: 



The present invention differs fi-om classical Widrow-Hoff techniques. By its 
design, the Widrow-Hoff algorithm estimates h{a)) , and then, once estimated, the 



method described herein uses only the modulus of h(Q)) , and therefore only the modulus 



a((o) . Accordingly, the present invention does not estimate the phase but is based on 





algorithm uses h(a)) to subtract a filtered-by-h signal r firom x: x-h*r. Conversely, the 



of h is needed. As previously stated, the modulus of is h{6)) (/.e., \h{a))\ ) is denoted by 
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instantaneous time-frequency magnitude estimates. As a result, the present invention is 
more robust to alignment errors than Widrow-Hoff techniques. 

In an altemate embodiment of the present invention, time varying filter estimates 
(/.e, adaptive updates to a{Q)) ) may be implemented. This would require a 
manual segmentation of the data. More specifically, the data (Le. the two recordings x 
and r) are split into segments of a particular time interval (e.g., five minutes). The 
method described herein is applied to each segment. La yet another embodiment of the 
present invention, the value of a(co) may be set to 1. 

In an altemate embodiment of the present invention, the original recording ro(0 is 
recorded in the same environment/set-up as the recorded mixture x{t). For example, this 
can be doiie by using the same recording device for recording the mixture (e.g., cassette 
tape recorder) and the same playing device for playing the unwanted signal (e.g., a CD 
player). The recording device and the playing device would be placed in approximately 
the same physical location in a room of similar geometric structure and materials; The 
recording device records the original recording ro(0 being played by the playing device. 
The original recording ro(t) is used to compute an estimate of |r(^, a))\ . That is, the 
original recording ro(0 would serve the role of a{co)r(t^(o) in the time-frequency mask 
generation. 

In an altemate embodiment of the present invention, the following time-frequency 
mask may be used: 

where p is set to maximize intelligibility of the output signal. A default choice of ji can 
be determined from statistics of a{a))r(t, O)) and x{t, co). 
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The particular embodiments disclosed above are illustrative only, as the invention 
may be modified and practiced in different but equivalent manners apparent to those 
skilled in the art having the benefit of the teachings herein. Furthermore, no limitations 
are intended to the details of construction or design herein shown, other than as described 
in the claims below. It is therefore evident that the particular embodiments disclosed 
above may be altered or modified and all such variations are considered within the scope 
and spirit of the invention. Accordingly, the protection sought herein is as set forth in the 
claims below. 
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